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Concept drift refers to a non stationary learning problem over time. The training and the application 
data often mismatch in real life problems |61] . 

In this report we present a context of concept drift problem 1 . We focus on the issues relevant to 
adaptive training set formation. We present the framework and terminology, and formulate a global 
picture of concept drift learners design. 

We start with formalizing the framework for the concept drifting data in Section [TJ In Section [2] 

we discuss the adaptivity mechanisms of the concept drift learners. In Section [3] we overview the 

principle mechanisms of concept drift learners. In this chapter we give a general picture of the 

<T^ available algorithms and categorize them based on their properties. Section [5] discusses the related 

c/j research fields and Section [5] groups and presents major concept drift applications. 

O 

This report is intended to give a bird's view of concept drift research field, provide a context of the 
-— i research and position it within broad spectrum of research fields and applications. 

> 

OO 

1 Framework and Terminology 

d 



For analyzing the problem of training set formation under concept drift, we adopt the following 
framework. 

A sequence of instances is observed, one instance at a time, not necessarily in equally spaced time 
intervals. Let Xj 6 $t p is a vector in p-dimensional feature space observed at time t and yt is the 
corresponding label. For classification yt G Z 1 , for prediction y^ £ 3? . We call X^ an instance and 
a pair (X^, yt) a labeled instance. We refer to instances (Xi, . . . , Xt) as historical data and instance 
Xf+i as target (or testing) instance. 



> 



1.1 Incremental Learning with Concept Drift 

We use incremental learning framework. At every time step t we have historical data (labeled) 
available X H = (Xi, . . . , X<). A target instance Xt + i arrives. The task is to predict a label yt+i- 
For that we build a learner Ct, using all or a selection from the available historical data X H . We 



1 This is a working version, the categorization is in progress. The latest version of the report is available online: 



http://zliobaite.googlepages.com/Zliobaite_CDoverview.pdf . Feedback is very welcome 
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Figure 1: One time step (t) of the incremental learning process 



apply the learner Ct to predict the label for Xt+i. A prediction process at time step t is illustrated 
in Figure [T] That is for one time step. 

At the next step after the classification or prediction decision is casted, the label yt+i becomes 
available. How the instance Xt+i with a label is a part of historical data. The next testing instance 
X^_|_2 is observed. We picture a fragment of the incremental learning loop in Figure [2| The classifier 
training phase at time t is zoomed in. Training set formation strategies are the subjects of our 
investigation. They are depicted as a 'black box' in the figure. 

Every instance Xt is generated by a source St. We delay more formal definition of a source until 
the next section, for now assume that it is a distribution over the data. If all the data is sampled 
from the same source, i.e. Si = S2 = ■ ■ ■ = St+i = S we say that the concept is stable. If for any 
two time points i and j Si / Sj , we say that there is a concept drift. 

Note that a random noise (deviation) is not considered to be a concept drift, because the data 
generating source is still the same. 

The core assumption when dealing with the concept drift problem is uncertainty about 
the future. We assume that the source of the target instance Xt+i is not known with certainty. 
It can be assumed, estimated or predicted but there is no certainty. Otherwise the data can be 
decomposed into two separate data sets and learned as individual models or in a combined manner 
(then it is a multitask learning problem [9]). 

We do not consider periodic seasonality as concept drift problem. But if seasonality is not known 
with certainty, we consider it as concept drift problem. For instance, a peak in sales of ice cream is 
associated with summer but it can start at different time every year depending on the temperature 
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Figure 2: Incremental learning process 



and other factors, therefore it is not known exactly when the peak will start. 



1.2 Causes of a concept drift 

Before looking what can actually cause the drift, let us return to the source St and provide more 
rigorous definition of it. 

Classification problem independently of presence or absence of concept drift may be described as 
follows [112j . Let X £ ffl 3 is an instance in p-dimensional feature space. X € Cj, where c\, C2, ■ . . , 
is the set of class labels. The optimal classifier to classify X — > a is completely determined by a 
prior probabilities for the classes P(ci) and the class-conditional probability density functions (pdf) 
p(X|cj), i = l,...,k. 

We define a set of a prior probabilities of the classes and class-conditional pdf 's as concept or data 
source: 

S = {(P( Cl ),p(X| Cl )), (P(c 2 ),p(X|c 2 )), . . . , (P( Cfc ),p(X|c fc ))}. (1) 

When referring to a particular source at time t we will use the term source, while when referring to 
a fixed set of prior probability and the classes and class-conditional pdf we will use the term concept 
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and denote it S. 

Recall, that in Bayesian decision theory [32] the classification decision for instance X at equal costs 
of mistake is made based on maximal a posteriori probability, which for a class Cj is: 



where p(X) is an evidence of X, which is constant for all the classes Cj. 
As first presented by Kelly et al [76], concept drift may occur in thee ways. 

1. Class priors P(c) might change over time. 

2. The distributions of one or several classes p(X|c) might change. 

3. The posterior distributions of the class memberships p(c|X) might change. 

Note, that the distributions p(X|c) might change in such a way that the class membership is not 
affected (e.g. symmetric movement to opposite directions). 

Sometimes change in p(X.\c) (independently whether it affects p(c|X) or not) is referred as virtual 
drift and change in p(c|X) is referred as real drift |153j . We argue, that from practical point of view 
it is not essential whether the drift is real or virtual, since p(c|X) depends on p(X\c) as in Equation 
In this thesis now on we do not make a distinction between the real and virtual drifts. 

2 How Do Concept Drift Learners Work? 

Following the framework, which was set-up in the previous section, the learner should provide the 
most accurate generalization for the data at time t + 1. In order to build such a learner, four main 
design sub-problems need to be solved. 

A.l Future assumption: a designer needs to make an assumption about the future data source 



A. 2 Change type: a designer needs to identify possible change patterns. 

A. 3 Learner adaptivity: based on the change type and the future assumption a designer chooses 
the mechanisms which make the learner adaptive. 

A. 4 Model selection: a designer needs a criterion to choose a particular parametrization of the 
selected learner at every time step (e.g. the weights for an ensemble members, the window 
size for variable window method). 

All these sub-problems are the choices to be made when designing a learner. In Figure [3] we depict 
a positioning of each design sub-problem within the established learning framework. 

In the next subsections we discuss each of the design sub-problems individually. 




(2) 
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Figure 3: Sub-problems of the concept drift learner design. 



2.1 Future assumption 

Future assumption is the assumption to be made about the source St+i of the target instance Xt+i. 
We identify three types of choices here. 

1. Assuming that St+i = St- 

2. Estimating the source based on Xt+i. 

3. Predicting the change. 

The first option, assuming St+i = St, is the most common among concept drift, although rarely 
explicitly stated. It is assumed that in the nearest future we will see the data coming from the same 
source as we saw in the near past. 

The second option utilizes information from the unlabeled target instance Xj+i. Estimation of the 
source is usually done by measuring the distance between Xf + i and historical reference instances. 
The algorithms presented in [148] I123| [371 EZ] use this future assumption. 

Generally in concept drift problem the future data source is not known with certainty. However, 
there are methods using trainable prediction rules, to estimate the future state and incorporate 
that estimation into the incremental learning process. The algorithms using future predictions are 
presented in [H E57J HSU HQ . 
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2.2 Change types 



In Section [L2| we identified the causes of a drift, or what happens to the data generating source 
itself. Here by change types we mean the configuration patterns of the data sources over time. The 
structural types of change are usually defined based on those configurations. 

For intuitive explanation, let us for now restrict the number of possible sources over time to two: 
Si and Sjj. 

The simplest pattern of a change is sudden drift, when at time t$ a source Si is suddenly replaced by 
source Sn. For example, Kate is reading the news. Sudden interest in meat prices in New Zealand 
when she got an assignment to write an article, is a sudden drift. 

Gradual drift is another type often met in the literature. However in fact there are two types being 
mixed under this term. The first type of gradual drift is referring to a period when both sources Si 
and Sn are active (e.g. |141|, 1154] I112| ). As time passes, the probability of sampling from source 
Si decreases, probability of sampling from source Sn increases. Note, that at the beginning of this 
gradual drift, before more instances are seen, an instance from the source Sn might be easily mixed 
up with random noise. 

Another type of drift also referred as gradual includes more than two sources, however, the difference 
between the sources is very small, thus the drift is noticed only when looking at a longer time period 
(e.g. |148|, I57| , I5"]). We refer to the former type of gradual drift as gradual and the latter type of drift 
as incremental (or stepwise). For example, gradual drift is increasing interest in real estate, while 
Kate prefers real estate news more and more over time when her interest in buying a flat increases. 

Finally, there is another big type of drift referred as reoccurring context. That is when previously 
active concept reappears after some time. It differs from common seasonality notion in a way that 
it is not certainly periodic, it is not clear when the source might reappear. In Kate's example these 
are the biographies of Formula- 1 drivers. The interest is related to the schedule of the races. But 
she does not look up the biographies at the time of the races, because she is watching them at the 
time. She might want to look up them later in the middle of the week. And the particular drivers 
she will be interested in might depend on who won the races this time. 

In Figure [4] we give an illustration of the main structural drift types, assuming one dimensional 
data, where a source is characterized by the mean of the data. We depict only the data from one 
class. 

Note that the types of drifts discussed here are not exhaustive. If we think of a data segment of 
length t and just two data generating sources Si and Sn, the number of possible combinations of 
the sources (that means possible change patterns) would be 2*, a lot. Moreover, in concept drift 
research it is often assumed that the data stream is endless, thus there could be infinite number of 
possible change patterns. We define the major structural types, since we argue, that assumption 
about the change types is absolutely needed for designing adaptivity strategies. 

Recently there has been an attempt to categorize change types into mutually exclusive categories 
[108J based on number of reoccurences, severity, speed and predictability. In principle the proposed 
categorization tires to quantify the main aspects of the learner design process into change categor- 
ization. We argue that the categories cannot be mutually exclusive, because the change frequency 
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Figure 4: Illustration of the four structural types of the drift. 



count, speed, severity is relative to the length of the subsequence, at which one is looking. Thus we 
restrict our categorization to very few qualitative categories. We reserve predictability as a part of 
the learner design process (future assumption), not the change itself. 



2.3 Learner adaptivity 

We identify four main adaptivity areas: 



1. Base learners can be adaptive (e.g. configuration of decision tree nodes [68J). 

2. Parametrization of the learners can be adaptive (e.g. weighting training samples in support 
vector machines [810. 

3. Adaptive training set formation (e.g. training windows, instance selection) can be employed, 
which is the scope and focus of this thesis. Training set formation can be decomposed into 

• training set selection, 

• training set manipulation (e.g. bootstrapping, noise), 

• feature set manipulation. 

4. Fusion rules of the ensembles ( |142[ 11501 1141] ). 



The adaptivity strategies, which are based on training set selection selection, can be generally 
divided into windowing (selecting training instances consecutive in time) and instance selection 
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Figure 5: The design process of a concept drift learner. 



(when sequential in time instances are selected as a training set). The choice of adaptivity strategy 
strongly depends on the assumption about the change type, discussed in the previous section. For 
sudden drift windowing strategies are generally preferred, while for gradual drift and reoccurring 
contexts instance selection strategies are preferred. 



2.4 Model selection 

In this thesis we use the generalization error as the primary measure of the concept drift learner 
performance. Thus for model selection (training) purposes the procedure of estimation the expected 
generalization error for target instance 'X.t+i at every time step needs to be defined. The two main 
options are: 

1. theoretical evaluation of the generalization error, and 

2. estimation of the generalization error using cross validation. 

In any case error estimation choice is strongly related to the future assumption, because it depends 
on the expectation regarding the future data source St+i- 

The design process of a concept drift learner is graphically illustrated in Figure [5] (a). We see 
relations (1) and (2) as key issues in designing concept drift learners. (1) the strategies selected 
to make the learners adaptive would strongly depend on the assumption about the change type, 
present in the data. (2) the model selection and evaluation strategies would strongly depend on the 
assumption about the future data source, on which the learner will be applied. 



S 



It is common to categorize concept drift learners into two major groups: 

1. learner adaptivity is initiated by a trigger (or active change detector), and 

2. a learner regularly evolves independently of the alarms or detectors. 

The two categories can be positioned within the design framework we just defined. In the first 
group the initiation for learner adaptivity comes from the 'change type' block, while in the second 
group the adaptivity is based on 'model evaluation and selection' block. The process is illustrated 
in Figure [5] (b). 

We will give more details about the categories of the drift learners in the next section, where we 
overview the related work. 

3 Taxonomy of Available Concept Drift Learners 

In this section we overview and map the related work. This section is intended to give a general 
view, the approaches specifically related to our work will be presented in corresponding chapters. 
The overview is concentrated on a supervised learning under concept drift. 

Schlimmer and Granger [136] in 1986 formulated the problem of incremental learning from noisy 
data and presented an adaptive learning algorithm STAGGER. They are the authors of the term 
'concept drift'. Since then a number of studies dealing with concept drift problem appeared. There 
were three 'peaks' in interest, one around 1998 followed by a special issue of Machine Learning 
journal [38], the other around 2004 followed by a special issue of Intelligent Data Analysis journal 
[88] , The third 'peak' started around 2007 and continues now on, as a result of increasing loads 
of streaming data and computational resources. Several PhD theses have directly addressed the 
problem of concept drift [E2 E31 122 ESI ED] ■ 

The learners responsive to a concept drift can be divided into two big groups based on when 
the adaptivity is 'switched on'. They are either trigger based or evolving. Trigger based means 
that there is a signal which indicates a need for model change. The trigger directly influences 
how the new model should be constructed. Most often change detectors are employed as triggers. 
The evolving methods on the contrary do not maintain an explicit link between the data progress 
and model construction and usually do not detect changes. They aim to build the most accurate 
classifier either by maintaining the ensemble weights or prototyping mechanisms. They usually keep 
a set of alternative models, and the models for a particular time point are selected based on their 
performance estimation. This is 'why' dimension in the taxonomy. 

Another dimension for grouping concept drift learners is based on how the learners adapt. What are 
the actual adaptation mechanisms? The mechanisms were discussed following the design assumption 
A. 4 presented in Section [2] Generally the adaptation mechanisms are either related to training set 
formation or a design and parametrization of the base learner. 

Based on those two dimensions we overview the main methodological contributions available in the 
literature. The taxonomy is graphically presented in Figure [6j The positions of popular techniques 
(our interpretation) are indicated by ellipses. 
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Figure 6: A taxonomy of adaptive supervised learning techniques. 



3.1 Evolving learners 

We start by overviewing the evolving techniques. Some of the techniques discussed above employ 
change detection mechanisms, still these are not the triggers of adaptation ('detect and cut'), but 
rather a tool to reduce computational complexity. First we discuss ensemble techniques, which 
make the largest group, and then other evolving techniques. 



3.1.1 Adaptive ensembles 

The most popular evolving technique for handling concept drift is classifier ensemble. Classification 
outputs of several models are combined or selected to get a final decision. The combination or 
selection rules are often called fusion rules. 

There is a number of ensembles for concept drift, where the ideas are not specific to particular type 
of base learners (although some studies are limited to testing one base learner) [531 H41| 11421 H5U| 
H351 [721 1371 [JJ QI3 [EMJ HI Q231 ED]. There are also base learner specific ensembles. In those 
classifier combination rules usually depend on the base learner specific parameters of the learned 
models: |8H 179] with SVM, |157j with Gaussian mixture models, [127] with perceptrons, [95J with 
kNN. 

In both cases adaptivity is achieved by fusion rules, i.e. how the weights are assigned to the 
individual model outputs at each point in time. In a discrete case an output of a single model 
might be selected. In this case all except one models get zero weights. The weight indicates the 
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'competence' of a base learner, expected in the 'nearest future' (future assumption A.l). The 
weight is usually a function of the historical performance [831 QUI Q321 USB E21 13 QUE ES3 [T5T] 
in the past or estimated performance using selective cross validation (148|, 11371 1151} HH [§S] or base 
learner specific performance estimates [811 (T9J, 11571 H27| . The historical evaluation is restricted to 
sudden and incremental drifts, while cross validation allow taking into account gradual drifts and 
reoccurring contexts. 

In adaptive ensemble learners much attention is drawn to model evaluation and fusion rules (A. 4), 
while little attention is drawn to the model construction (A. 3). Still there is a number of options 
how to build diverse base classifiers. Usually the implicit aim is to have at least one classifier in 
the ensemble trained for each distinct concept. This can be achieved using different training set 
selection strategies. 

The straightforward approach is to divide historical data into blocks, which include instances se- 
quential in time. Often these blocks are non overlapping jUS USUI EH HS1 1721 I5TI 1791 195]. 
sometimes overlapping [33]. These techniques are suitable for sudden and to some extent to incre- 
mental drifts, they favor reoccurring contexts. Another approach is using different sized training 
windows [83, 11411113711127] . which implicitly assume that once off sudden drift has happened. Train- 
ing windows are overlapping sequential blocks of instances, but all of them have fixed ending 'now' 
(time t). The individual models in ensembles can also be constructed using non sequential instance 
selection [151] . This technique is more suitable to gradual drift, as well as reoccurring contexts. 

Another approach to building diverse base classifiers is to use the same training data, but different 
types of base learners (e.g. SVM, decision tree, Naive Bayes) [1651 H31] . 

All these techniques build individual models from what has already been seen in the past. In 
principle base classifiers can also be built adding unseen data, for instance noise or unlabeled 
testing data, which is listed as our future work. 

3.1.2 Instance weighting 

Instance weighting weighting methods make another group of evolving adaptation techniques. The 
algorithms can consist of a single learner [851 H63| 1117] or an ensemble [561 El ED] > but the adaptivity 
here is achieved not by combination rules, but by systematic training set formation. Ideas from 
boosting [52] are often employed, giving more attention to the instances which were misclassified. 

3.1.3 Feature space 

There are models, manipulating feature space to achieve adaptivity. [50J uses ideas from transfer 
learning to achieve adaptivity. New features are added to the training instances, which contain 
information from the past model performances. [16j augments the feature space by a time stamp. 
[751 1152] use dynamic feature space over time. In [5] the variables to observe next are adaptively 
selected. 



11 



3.1.4 Base model specific 



There are also models to be mentioned, where adaptivity is achieved by managing specific model 
parameters or design. |116j maintain variable training window via adjusting internal structure of 
decision trees. Regression parameters are being adjusted in |76j . Past support vectors are transfered 
and combined with the recent training data in |145j . The later examples illustrate the variety of 
possible specific model designs. 

3.2 Learners with triggers 

Another group of methods uses triggers, which determine how the models or sampling should be 
changed at a given time. 

3.2.1 Change detectors 

The most popular trigger technique is change detection, which is often implicitly related to a sudden 
drift. Change detection can be based on monitoring the raw data [T3 [ I119j . the parameters of the 
learners [1 131 or the outputs (error) of the learners [SSI 03 H14j . [H] develop change detection 
methods in each of the three categories. The detection methods usually cut the training window at 
change point, although the change point and training window might not be the same [166J. 

3.2.2 Training windows 

There are methods using heuristics for determining training window size |154[ |98[ HI I161| . The 
heuristics is related to error monitoring. The training window is determined using look up table 
principles, where there is an action for each possible value of a trigger. There also are base learning 
specific methods, for determining training windows [68 [ 11631 l92l 1147] . The window size is also 
determined based on historical accuracy. 

3.2.3 Adaptive sampling 

The listed trigger based methods were using training windows. Another group of trigger based 
methods use instance selection. The incoming testing instances (unlabeled) are inspected. Based 
on the relation between the testing instance and predefined prototypes [371 EH EH H62j or historical 
training instances directly [l23[ I65| Wf\ [TO] a training set for a given instance is selected. 

3.3 Discussion 

In Table [T] we provide a summary of the listed algorithms. The properties are structured according 
to the four design assumptions, which were discussed in Section [2] The categorization is based on 
our interpretation of the methods. 
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Figure 7: Categorization of the related areas. AIS - Artificial Immune Systems; DBN - Dynamic 
Bayesian Networks; UKD - Ubiquitous Knowledge Discovery; ART - Adaptive Resonance Theory; 
CBR - Case Based Reasoning. 

Change detectors and ensembles are the two most popular techniques. Change detectors are natur- 
ally suitable for the data where sudden drift is expected. Ensembles, on the other hand, are more 
flexible in terms of change type, while they can be slower in reaction in case of a sudden drift. 

We overviewed general methods for handling concept drift in supervised learning. A discussion of 
specific applications will follow in Section [5j Before proceeding to applications let us look at the 
broader context of learning with changing data. 

4 Related Research Areas 

After reviewing the adaptive techniques for supervised learning, which were mostly developed in 
data mining and machine learning communities, we now give an interdisciplinary perspective of the 
concept drift problem. In this section we point the 'neighboring' research fields. We pick the works, 
which are not necessary the 'key' references in these fields, but the ones which touch a problem of 
dataset change. 

We present the research fields in three categories, which we identified as connections with the 
concept drift problem: time, knowledge transfer and adaptivity. In Figure [7] we position the related 
areas within these three categories. We discuss them in the following sections. 
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4.1 Time context 



Time context in concept drifting problems means that the data is sequential in time and the models 
are also associated with time and need to be continuously updated. There are research fields focusing 
on the aspects of model update primarily for a stationary data. 

Incremental learning focuses on machine learning the tasks, where all the training data is not 
available at once \47\ 159] . The data is received over time thus the models need to be updated or 
retrained, to increase the accuracy. Schlimmer and Granger |136] introduced the assumption of 
concept change in incremental learning context. 

Over decades incremental learning area became less active. It was gradually overtaken by data 
stream mining, where the data flow is continuous an rapid [53]. Data stream mining focuses on the 
processing speed and complexity, thus naturally the attention toward timely change detection [101J 
including anomaly detection [27] has increased. 

Spatio - temporal data mining deals with database models to accommodate temporal aspects [121, 
1130] . Temporal data mining |96] incorporates time dimension into data mining process. 

Dynamic Bayesian networks are causal models assuming forward relation between the variables in 
time [29] . 

Finally, in time series analysis non stationarity is handled using ARIMA models [22J. 
4.2 Knowledge transfer 

Knowledge transfer means that regularly there is a potential difference between the distribution of 
training data and the data to which the models will be applied (testing data). Thus the information 
from the old data needs to be adapted to fit to the new data. In concept drift problem this 
discrepancy arises in time, due to changes in the data generating process. However, a dataset shift 
can have a number of other reasons, for instance, sample selection bias |llj . domain shift due to 
changes in measurements, model shift due to imbalance of data |125| . discrimination in decision 
making |71j . which are out of the scope of this thesis. In addition, the knowledge from related 
problem might be transfered to solve a related one. 

Case based reasoning (CBR) [35J is the process of solving new problems based on the solutions of 
similar past problems. Generally CBR can be treated as lazy learning. Lazy learning does not build 
generalizing models, but maintain a database of reference data and uses the relevant past data only 
when a related query is made [I]. In this domain Aha [2] introduced noise tolerant instance based 
algorithms, IB3 was the first instance based technique capable of handling concept drift. 

A great part of lazy learning research is devoted to instance selection methods to increase accuracy. 
There is another related instance selection research area (not necessarily lazy learning) aiming to 
reduce the learning complexity by data reduction [128J. 

In machine learning the process of applying the knowledge gained on solving a similar problem 
is referred as transfer learning [133J or inductive transfer. The ideas of inductive transfer were 
extended to temporal representation and used for learning under concept drift [50] . 
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Adaptive knowledge transfer has been exploited in multitask learning |107| and learning from mul- 
tiple sources [3111103] . Non stationarity problem in machine learning community is sometimes called 
covariate shift P33 [70] . 

Finally, a field of active learning [138] is remotely related to the problem of concept drift. In active 
learning the data is labeled on demand, the methods select the instances which need to be labeled 
to make the learner more accurate or reduce labeling costs. The relation to concept drift problem 
is in the ways the methods identify, how well the unlabeled instances correspond to a particular 
concept. 

4.3 Model adaptivity 

Model adaptivity here means the models which have the properties of adaptation incorporated into 
learning. The adaptation might be to a change, as in concept drift problem. Adaptation can also 
mean the learning process (in stationary or non stationary environment), when the accuracy of the 
model is increasing along with more incoming examples. 

Artificial immune systems (AIS) are inspired by immunology |51j . They are adaptive to changes like 
biological immune systems. AIS use evolutionary computation and memory to learn to recognize 
changing patterns. 

Adaptive resonance theory, dating back 30 years [60], is based on the model of information processing 
by the brain [23]. Having self-adjusting memory as one of the desired system properties. 

In evolutionary computation dynamic optimization problems are actively studied |110j . The goal 
is track the optima which is dynamically changing in time. The major approaches are related to 
maintaining and enhancing diversity, expecting that once the optima changes, there are suitable 
models available withing the pool [160] . A a next step in this direction is to seek for a relation 
between the past models and current task |129j . In |132j a relation between the change type and 
magnitude and the evolutionary algorithm is introduced. 

Ubiquitous knowledge discovery is an emerging area, which focuses on learning in distributed and 
mobile systems [105] . The systems work in environment, they need to be intelligent and adaptive. 
The objects of UKD systems exist in time and space in a dynamically changing environment, they 
can change location and might appear or disappear. The objects have information processing cap- 
abilities, know only their local spatio-temporal environment, act under real-time constraints and 
are able to exchange information with other objects. These objects are humans, animals, and, 
increasingly, computing devices. 

To sum, the problem of change is far not limited to data mining and machine learning community. 
Concept drift problem lies in all three dimensions: time dimension, need for adaptivity and know- 
ledge transfer. 
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5 Applications 



In this section we survey applications, where concept drift problem is relevant in both supervised 
and unsupervised learning. We present the real life problem, discuss the sources of a drift and the 
actual learning tasks in the context of these problems. 

We find four general types of applications: monitoring control, personal assistance, decision mak- 
ing and artificial intelligence. Monitoring and control often employs unsupervised learning, which 
detects abnormal behavior. It includes detection of adversary activities on the web, computer 
networks, telecommunications, financial transactions. Personal assistance and information applic- 
ations include recommender systems, categorization and organization of textual information, cus- 
tomer profiling for marketing. Decision making includes diagnostics, evaluation of creditworthiness. 
The 'ground truth' is usually delayed, i.e. the true answer whether the decision was correct be- 
comes available only after certain time. Artificial intelligence applications include a wide spectrum 
of moving and stationary systems, which interact with changing environment, for instance robots, 
mobile vehicles, smart household appliances. 

We define five dimensions, relevant to the applications facing concept drift: 

1. the speed of learning and output, 

2. classification or prediction accuracy, 

3. costs of mistakes, 

4. true labels, 

5. adversary activities. 

The speed of learning output means what is a relative volume of data and how fast the decision 
needs to be made. For example, in credit card fraud detection the decision needs to be fast to stop 
the crime and the data loads are huge, while in credit evaluation a decision regarding the credit can 
be made even in a few days time. In both cases adversary activities to cheat the system might be 
expected, while adversary activities in diagnostics would make less sense. The precise accuracy in 
diagnostics is generally much more significant than in movie recommendations, moreover, in movie 
recommendations the decision might be 'soft' in a sense the viewer is not always deterministic, 
which movie he or she liked more. 

Our global interpretation of the four types of applications in accord with these dimensions is provided 
in Table H 



Table 2: Types of applications with concept drift. 





Decision 
speed 


Accuracy 


Costs of 
mistakes 


Labels 


Adversary 


1. 


Monitoring & control 


high 


approximate 


medium 


hard 


active 


2. 


Assistance & information 


medium 


approximate 


low 


soft 


low 


3. 


Decision making 


low 


precise 


high 


delayed 


possible 


4. 


AI and robotics 


high 


precise 


high 


hard 


low 
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In the following sections we discuss each of the application types separately and give arguments for 
the choices we made in the table. 

5.1 Monitoring and Control 

In monitoring and control applications the data volumes are large and it needs to be processed in 
real time. Two types of tasks can be distinguished: prevention and protection against adversary 
actions, and monitoring for management purposes. 

5.1.1 Monitoring against adversary actions 

Monitoring against adversary actions is often an unsupervised learning task or one class classifica- 
tion, where the properties of 'normal behavior' are well defined, while the properties of attacks can 
differ and change from case to case. Classes are typically highly imbalanced with a few real attacks. 

Computer security. Intrusion detection is one of the typical monitoring problems. That is a 
detection of unwanted access to computer systems mainly through network (e.g. internet). There 
are passive intrusion detection systems, which only detect and alert the owner, and active systems, 
which take protective action. In both cases here we refer only to a detection part. 

Adversary actions is the primary source of concept drift in intrusion detection. The attackers try to 
invent new ways how to attack, which would overcome the existing security. The secondary source 
of concept drift is technological progress in time, when more advanced and powerful machines are 
created, they become accessible to intruders. 'Normal' behavior can also change over time. 

Lane and Brodley |91| explicitly formulated the problem of concept drift in intrusion detection a 
decade ago. They presented a detection system using instance based learning. Current research 
directions and problematics in intrusion detection can be found in a general review [118] . From 
supervised learning, lately, ensemble techniques have been proposed |104j . Artificial immune systems 
are widely considered for intrusion detection [77], 

Telecommunications. Adversary behavior also applies to telecommunications industry, both 
intrusion and fraud. Mobile masquerade detection problem [!Q6j from research perspective is closely 
related to intrusion detection. The goal is to prevent adversaries from unauthorized access to a 
private data. The sources of concept drift are again twofold: adversary behavior trying to overcome 
the control as well as changing behavior of legitimate users. Fraud detection and prevention in 
telecommunication industries [66J is also subject to concept drift due to similar reasons. 

Finance. In financial sector data mining techniques are employed to monitor streams of financial 
transactions (credit cards, internet banking) to alert for possible frauds. Insider trading in stock 
market is one more application. 

Both supervised and unsupervised learning techniques are used [20] for detection of fraudulent 
transactions. The data labeling might be imprecise due to unnoticed frauds, legitimate transactions 
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might be misinterpreted and the imbalance of the classes is very high (few frauds as compared to 
legitimate actions). Concept drift in user behavior is one of the challenges. 

Insider trading is trading in stock market based on non-public information about the company, in 
most countries it is prohibited by law. Inside information can come in many forms: knowledge of a 
corporate takeover, a terrorist attack, unexpectedly poor earnings, the FDA's acceptance of a new 
drug |40j, inside trading disadvantages regular investors. There is a potential for concept drift, since 
the inside traders would try to come up with novel ways to distribute the transactions in order to 
hide. 

5.1.2 Monitoring for management 

Monitoring for management usually uses streaming data from sensors. It is also characterized by 
high volumes of data and real time decision making; however, adversary cases usually are not 
present. 

Transportation. Traffic management systems use data mining to determine traffic states [32] , 
e.g. car density in a particular area, accidents. Traffic control centers are the end users of such 
systems. Transportation systems are dynamic (always moving). The traffic patterns are changing 
seasonally as well as permanently, thus the systems have to be able to handle concept drift. 

Data mining can also be employed for prediction of public transportation travel time [109] . which 
is relevant for scheduling and planning. The task is also subject to concept drift due to traffic 
patterns, human driver factors, irregular seasonality. 

Positioning. Concept drift is also relevant in remote sensing in fixed geographic locations. In- 
teractive road tracking is an image understanding system to assist a cartographer annotating road 
segments in aerial photographs [164] . In this problem change detection comes into play when gener- 
alizing to different roads over time. In place recognition [102] or activity recognition |100j dynamics 
of the environment cause concept drift in the learned models. 

Climate patterns, such as floods, are expected to be stationary, but the detection systems have 
to incorporate not regular reoccurring contexts. In a light of a climate change the systems might 
benefit from adaptive techniques, for instance, sliding window training [87]- In |86j the authors use 
active learning of non stationary Gaussian process for river monitoring. 

Industrial monitoring. In production monitoring human factor can be the source of concept 
drift. Consider a boiler used for heat production. The fuel feeding and burning stages might 
depend on individual habits of a boiler operator, when the fuel is manually input into the system 
[6]. The control task is to identify the start and end of the fuel feeding, thus algorithms should be 
equipped with mechanisms to handle concept drift. 

In service monitoring changing behavior of the users can be the source of a drift. For example, 
data mining is used to detect accidents or defects in telecommunication network [120J. A change in 
call volumes may be the results of an increased number of people trying to call friends or family to 
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tell them what is happening or a decrease in network usage caused by people being unable to use 
the network. Or the change might be unrelated to the telecommunication network at all. The fault 
detection techniques have to be able to handle such anomalies. 

5.2 Personal Assistance and Information 

These applications mainly organize and/or personalize the flow of information. The applications can 
be categorized into individual assistance for personal use, customer profiling for business (marketing) 
and public or specified information. In any case, the class labels are mostly 'soft' and the costs 
of mistake are relatively low. For example, if a movie recommendation is wrong it's not a world 
disaster and even the user himself or herself might not know for sure, which of the two given movies 
he or she likes more. 

5.2.1 Personal assistance 

Personal assistance applications deal with user modeling aiming to personalize the flow of informa- 
tion, which is referred as information filtering. A rich technical presentation on user modeling can 
be found in |57J. One of the primary applications of user modeling is representation of queries, 
news, blog entries with respect to current user interests. Changes in user interests over time are 
the main cause of concept drift. 

Large part of personal assistance applications are related to textual data. The problem of concept 
drift has been addressed in news story classification [156] [15] or document categorization |99} |82"1 
1111] . |75j in a light of changing user interests address the issue of reoccurring contexts. Drifting 
user interests are relevant in building personal assistance in digital libraries |64j or networked media 
organizer |48j . 

There is also a large body of research addressing web personalization and dynamics [158, 135, 331 123] . 
which is again subject to drifting user interests. In contrast to end user text mining discussed before, 
here mostly interim system data (logs) is mined. 

Finally, concept drift problem is highly relevant for spam filtering |36|, 146] . First of all there are 
adversary actions (spamming) in contrast to the personal assistance applications listed before. That 
means the senders are actively trying to overcome the filters therefore the content changes rapidly. 
Adversaries are intelligent and adaptive. Spam types are subject to seasonality and popularity of the 
topics or merchandises. There is a drift in the amount of spam over time, as well as in the content 
of the classes |45| . Spam messages are disjunctive in content. Besides, personal interpretation of 
what is spam might differ and change. 

5.2.2 Customer profiling 

For customer profiling aggregated data from many users is mined. The goal is to segment the 
customers according to their interests. Since individual interests are changing over time, customer 
profiling algorithms should take this non stationarity into account. 



20 



Direct marketing is one of the applications. Adaptive data mining methods are used in customer 
segmentation based on product (cars) preferences (32] or service use (telecommunications) |16| . 
Lately in addition to similarity measures between individual customers social network analysis has 
been employed into customer segmentation [93]. It is observed that user interests do not evolve 
simultaneously. The users that used to have similar interests in the past might no longer share the 
interests in the future. The authors model this as an evolving graph. Adaptivity is also relevant to 
association rule mining applied to shopping basket identification and analysis |134j. 

Automatic recommendations can be related to both customer profiling and personal assistance. 
The recommender systems are characterized by sparsity of data. For example, there are only a 
few movie ratings per user, while the recommendations need to be inferred over the while movie 
pool. The publicity of recommender systems research has increased rapidly with a NetFlix movie 
recommendation competition. The winners used temporal aspect as one of the keys to the problem 
|84| [8] . Three sources of drift were noted movie biases (popularity changes over time), user bias 
(natural drift of users' rating scale benchmarking to the recent ratings) and changes in user prefer- 
ences. There are earlier works on recommender systems in which changes over time were addressed 
|39j via time weighting. 

5.2.3 Information 

Information applications are related to changes in data distribution over time, which is sometimes 
referred as virtual drift in concept drift literature [153] . Then changes in class assignment is called 
real drift. Virtual drift would typically occur over longer period of time. For example, in news 
recommendation system, the news about meat prices in New Zealand suddenly become relevant for 
Kate (the label changes, but the document comes from the same distribution as before). It might 
happen that the consumers in New Zealand would switch from pork to beef, thus the distribution 
of articles about meat would change independently from Kate's interests. 

Document organization is the first category of information applications. Given e-mail, news 
or document streams, the task is to extract meaningful structures, organize the data into topics. 
Temporal order is necessary for making sense. The topics themselves and even the vocabulary for 
particular topics change in time. 

The state of the art Latent Dirichlet Allocation model for probabilistic document corpus modeling 
was recently equipped with a time dimension [18} 1149] . In [18] the dynamics of scientific topics 
articles of Science magazine from 1881 to 1999 (120 years) was analyzed, the emergence, peak and 
decline of topics was showed, the topic vocabulary representation was build. [161 j incorporated the 
time stamp into the static model. [78] presented a method for organization of e-mail messages, to 
provide a framework for content analysis. Intuitively this is similar to including time feature into 
the original observation. 

Economics. Concept drift is relevant in making macroeconomic forecasts [58], predicting the 
phases of a business cycle |8U| . The data is drifting primary due to large number of influencing 
factors, which are not feasible to be taken into prediction models. Due to the same reason financial 
time series are known to be non stationary to predict [62J. 
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In business management, in particular, software project management, careful planning can be 
inaccurate if concept drift is not taken into account. [33] employ data mining models for project 
time prediction, the models are equipped with concept drift handling techniques. 

5.3 Decision Making 

Decision making and diagnostics applications usually involve limited amount of data (might be 
sequential or time stamped). Decisions are not required to be made in real time, thus the applied 
models might be computationally expensive. But high accuracy is essential in these applications 
and the costs of mistakes are large. 

Finance. Bankruptcy prediction or individual credit scoring is typically considered to be a sta- 
tionary problem [90]. However, in these problems concept drift is closely related to a hidden context 
[63], changes in context, which is not observed or measured in the original model. The need for 
different models for bankruptcy prediction under different economic conditions was acknowledged 
and proposed in (14=4] . The need for models to be able to deal with non stationarity has been 
rarely acknowledged [67J. Although concept drift problem is present, adversaries might make use 
of full adaptivity of the models. Thus offline adaptivity, which would be restricted to already seen 
subtypes of customers, is needed [165J . 

Biomedical applications can be subject to concept drift due to adaptive nature of microorgan- 
isms [139, 148J. The effect of antibiotics to a patient is often naturally diminishing over time, since 
microorganisms mutate and evolutionary develop antibiotic resistance. If a patient is treated with 
antibiotic when it is not neccesary, a resistance might develop and antibiotics might no longer help 
when they are really needed. 

Clinical studies and systems need adaptivity mechanisms to changes caused by human demographics 
[89] 154] . The changes in disease progression can also be triggered by changes in a drug being used 
|17| . In incremental drug discovery experiments the drift between training and testing sets can 
caused by non uniform sampling [49 . 

Data mining can be used to discover emerging resistance and monitor nonsomnical infections in 
hospitals (the infections which result from the treatment) [69]. Given patient and microbiology 
data as an input, the task is to model the resistance. The resistance changes over time. 

Finally, concept drift occurs in biometric authentication [159} 1122] . The drift can be caused by chan- 
ging physiological factors, for example growing beard. Like in credit applications, here adaptivity 
of the algorithms should be used with caution, due to potential adversary behavior. 

5.4 AI and Robotics 

In AI applications the problem of concept drift is often called dynamic environment. The objects 
learn how to interact with the environment and since the environment is changing, the learners need 
to be adaptive. 
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5.4.1 Mobile systems and robotics 

Ubiquitous Knowledge Discovery (UKD) deals with the distributed and mobile systems, operating 
in a complex, dynamic and unstable environment. The word 'ubiquitous' means distributed at a 
time. Navigation systems, vehicle monitoring, household management systems, music mining are 
examples of UKD. 

DARPA navigation challenge was presented in [146] . A winning entry in 2005 used online learning 
for road image classification into drivable and not drivable. They used an adaptive Mixture of 
Gaussians, for gradual adaptation they were adjusting the internal Gaussian and rapid adaptation 
by replacement of the Gaussians with the new ones. The needed speed of adaptation would depend 
on the road conditions. 

Adaptivity to changing environment has been addressed in robotics [124] . for instance in designing 
a player for robot soccer [94J . 

5.4.2 Intelligent systems 

'Smart' home systems [126] or intelligent household appliances [33] also need to be adaptive to 
changing environment and user needs. 

5.4.3 Virtual reality 

Finally, virtual reality needs mechanisms to take concept drift into account. In computer game 
design |28j adversary actions of the players (cheating) might be one of the drift sources. In flight 
simulation the strategies and skills differ across different users |63j . 

In Table [3] we summarize the discussed applications with concept drift. 

6 Terminology 

Concept drift is relatively new research field and the terminology is not yet fixed. Moreover, the 
problem of shifting data is discovered and handled in very broad domain area. With the loads 
of data more and more attention is drawn to the differences between training and testing data 
distributions. We provide alternative terminology in Table |4j 

7 Concluding Remarks 

We provided an overview of the available concept drift responsive techniques and real learning tasks, 
where concept drift problem is relevant. 

The problem of concept drift is very broad. There has been plenty of general research and attempts 
to understand the phenomena. Generalization is not possible without assumptions about the nature 
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Table 4: Terminology across research fields. 



Data mining 
Machine learning 
Evolutionary computation 
AI and Robotics 
Statistics, time series 
Databases 

Information retrieval 



concept drift 

concept drift, covariate shift 
changing environment 
dynamic environment 
non stationarity 
concept drift, load shedding 
temporal evolution 



of the change. This depends on the data and the problem. The focus on applications has been 
limited so far. Lack of real data? 

We argue that the challenges are different for different types of applications, see Table [2} How 
quickly does the data change? Is it worth complicating the model? It is if we deal with 100 years 
history of Science papers. Do we want full model adaptability? Is it secure? Wouldn't a simple 
training window be enough in the most practical cases? Maybe focusing on selecting a proper the 
base model is essential? 

In our opinion, focus on specific models for specific problems is prospective. 
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